[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. by maralbahari · Pull Request #33892 · vllm-project/vllm

maralbahari · 2026-02-05T09:32:23Z

Purpose

This PR refactors block scaled linear kernel into kernel abstraction.

changes:

Introduces MMLinearKernel base interface for all linear kernels.
Introduces Params, Fp8Params and Int8Params, classes to access layer params in structured format.
Introduces DynamicMMLinearKernel which is a type of MMLinearKernel with two main properties of base and fallback kernels that are variant of MMLinearKernel. this class switches between base and fallback
implementations at runtime.
Removing the legacy W8A8BlockFp8LinearOp class.
Unifying kernel selection for both block and non-block quantization
Updating all consumers (fp8.py, modelopt.py, tests, benchmarks)

Test Plan

Cuda platfrom
run ci/cd tests.

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block

Test Result

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8196	±	0.0106
		strict-match	5	exact_match	↑	0.8954	±	0.0084

W8A8 Block Linear Refactor PRs:

[1/N] [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. #33047: Moves all the quantization ops into the same QuantFP8 class. (merged)
[2/N]: [W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. #33892: Extract block scaled mm linear kernels into kernel abstraction and removes the W8A8Fp8BlockLinearOp class and updates all code paths and files that use this class.
[3/N] [Linear Kernel Refactor] Make all scaled MM kernels inherit from common generic base. #33893: Applies base inheritance to the remaining ScaledMM kernels for consistent code and improved maintainability of linear kernel classes.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.

I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py

tests/utils.py

vllm/model_executor/kernels/linear/scaled_mm/cuda.py

Signed-off-by: maral <maralbahari.98@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-02-06T14:07:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-04-06T02:01:35Z

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify · 2026-04-06T02:54:47Z

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>

tjtanaa

LGTM

…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com> Signed-off-by: jackcfwang <jackcfwang@tencent.com>

Required by NvFp4LinearKernel refactor (vllm-project#39129). Copied from upstream/main rather than cherry-picking the full W8A8 block linear refactor (vllm-project#33892, 35 files). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…selections (vllm-project#33892) Cherry-picked from upstream vllm-project/vllm@2e9034c99. Required dependency for NvFp4LinearKernel refactor (vllm-project#39129) — provides base.py, block-scaled kernel classes, and updated FP8 utils. Also synced nvfp4_emulation_utils.py for kE2M1ToFloat_handle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous cherry-pick of vllm-project#33892 overwrote NVFP4 exports from vllm-project#39129. Synced to upstream/main which has both FP8 block and NVFP4 kernel exports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stream regressions in attention, FP8, offloading and platform (#1338) ## Summary Fixes five regressions introduced by recent upstream vLLM changes that break HPU unit tests and model execution. ## Changes 1. **Remove `use_output` guard from HPU attention patch** — attribute removed upstream 2. **Remove `accept_output_buffer` branching from HPU MLA attention** — attribute removed upstream; unconditionally use output buffer in opaque path, direct call path manages output internally 3. **Update KV offloading connector tests** — field renames: `block_hashes` → `keys`, `block_hashes_to_store` → `keys_to_store`, config access via `kv_group_configs[0]` 4. **Register HPU FP8 block-scaled kernel + add ops test conftest** — new `_POSSIBLE_FP8_BLOCK_KERNELS` dict needs OOT entry; provide `VllmConfig` stub for ops unit tests 5. **Add `manual_seed_all` to `HpuPlatform`** — new required platform method for RNG seeding ## Upstream PRs that introduced these regressions - vllm-project/vllm#39125 — removed `accept_output_buffer` and `use_output` from attention layer (fixes 1, 2) - vllm-project/vllm#37109 — restructured `OffloadingConnectorScheduler` API (fix 3) - vllm-project/vllm#33892 — added `model_config.dtype` access in `Fp8LinearMethod.__init__` and `_POSSIBLE_FP8_BLOCK_KERNELS` (fix 4) - vllm-project/vllm#38468 — added `manual_seed_all` as required abstract method on `Platform` (fix 5) --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

maralbahari added 2 commits February 5, 2026 04:38

create initial block scaled mm kernels and a common base

8a542b7

Signed-off-by: maral <maralbahari.98@gmail.com>

remove W8A8Fp8BlockLinearOp and adop mm kernel selection

0ebcf78

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added performance Performance-related issues nvidia labels Feb 5, 2026

github-project-automation bot added this to NVIDIA Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

vllm/model_executor/kernels/linear/scaled_mm/flashinfer.py Show resolved Hide resolved

tests/utils.py Show resolved Hide resolved

vllm/model_executor/kernels/linear/scaled_mm/cuda.py Outdated Show resolved Hide resolved

remove W8A8Fp8BlockLinearOp from unit tests

b76074c

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari mentioned this pull request Feb 5, 2026

[W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction. #33407

Draft

5 tasks

maralbahari and others added 8 commits February 5, 2026 18:04

Update vllm/model_executor/layers/quantization/kernels/base.py

3c7049e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/base.py

08a893d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/aite…

9847109

…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/cuda.py

5d58935

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

Update vllm/model_executor/layers/quantization/kernels/scaled_mm/Bloc…

9887678

…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>

fix pre-commit issues and typings

4b53675

Signed-off-by: maral <maralbahari.98@gmail.com>

imporve typing

acac7c1

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

61bfb5b

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot added the needs-rebase label Feb 6, 2026

maralbahari added 5 commits February 9, 2026 02:24

add missing kwargs for aiter fp8 block scaled mm func and return stat…

3363c88

…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

79951e2

…block-scaled-rfc-pr

fix f-string

6465faa

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

5b3c2e1

…block-scaled-rfc-pr

Merge remote-tracking branch 'origin/main' into 3n-block-scaled-rfc-pr

8dd23bd

Signed-off-by: maral <maralbahari.98@gmail.com>

mergify bot removed the needs-rebase label Feb 9, 2026

maralbahari added 2 commits February 9, 2026 04:04

improve documenetation and fix typings in init_fp8_linear_kernel

320ced0

Signed-off-by: maral <maralbahari.98@gmail.com>

Merge remote-tracking branch 'origin/2n-block-scaled-rfc-pr' into 3n-…

d0cd8a2

…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari marked this pull request as ready for review February 23, 2026 02:09

maralbahari requested review from mgoin, robertgshaw2-redhat and tjtanaa as code owners February 23, 2026 02:09

fix online fp8

9cb0ebf

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari added 3 commits April 6, 2026 02:55

fix pre-commit

bb9920f

Signed-off-by: maral <maralbahari.98@gmail.com>

fix mxfp8 linearmethod

884b952

Signed-off-by: maral <maralbahari.98@gmail.com>

fix pytorch compile test

8e6b3ab

Signed-off-by: maral <maralbahari.98@gmail.com>

maralbahari requested review from BoyuanFeng, ProExpertProg, vadiklyutiy, youkaichao and zou3519 as code owners April 6, 2026 13:21

tjtanaa added 3 commits April 7, 2026 10:31

Merge branch 'main' into 3n-block-scaled-rfc-pr

1a88320

Merge branch 'main' into 3n-block-scaled-rfc-pr

5b572ac

Merge branch 'main' into 3n-block-scaled-rfc-pr

92ba677

tjtanaa approved these changes Apr 9, 2026

View reviewed changes

tjtanaa merged commit 2e9034c into vllm-project:main Apr 9, 2026
140 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Apr 9, 2026

github-project-automation bot moved this from Todo to Done in AMD Apr 9, 2026

pawel-olejniczak mentioned this pull request Apr 10, 2026

[FIX_FOR_VLLM_CUSTOM=f976e3b98ba45677a2213673a442c6cbff141e8e] Fix upstream regressions in attention, FP8, offloading and platform vllm-project/vllm-gaudi#1338

Merged

cferra mentioned this pull request Apr 12, 2026

[Bug]: Gemma 4 31B FP8_BLOCK checkpoint produces garbage repetitive output — logit saturation at softcap wall due to absorbed activation scales being double-applied #39407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
tjtanaa merged 91 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr

maralbahari commented Feb 5, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

maralbahari commented Feb 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maralbahari commented Feb 5, 2026 •

edited by github-actions bot

Loading